## NULL

Introduction

With the increased scholarly interest in populism that political science has witnessed in the last two decades, the measurement of populism has presented itself as a major challenge to the discipline. Different instruments have been proposed to elicit measurements of populism (for review articles see Pauwels 2017; Bergman 2018; Kirk A. Hawkins et al. 2018), ranging from

This research note is addressing a challenge to the content-analytical measurement of populism that has hitherto received littel to no attention. Classical content-analytical measurement instruments present human coders with instances of text, which they then are asked to judge in accordance with a coding scheme (Krippendorff 2004). While researchers usually translate their theoretical concept of populism into a categorical coding scheme with the goal to illicit correct measurements, measurement ultimately relies on the judgment of human coders. Human coders—irrespectige of their domain-specific expertise and their level of prior training—are thus considered ‘noisy labelers’ in the statistical literature (Dawid and Skene 1979; Passonneau and Carpenter 2014; Guan et al. 2017). Coders varying abilities to correctly classify or rate instances of text thus generally constitutes one source of error in content-analytical measurements of populism that is not eliminable by design.

Depending on the degree of human coders’ falibility and the aggregation method used to obtain measurements at the level of instances, such coding error may impair measurement quality. This applies most specifically to conventional, non-parametric aggregation methods such as majority voting that due not account for the possibility of agreement-in-error (i.e., the majority of coders agrees on the false judgment, see Passonneau and Carpenter 2014). The goal of my analysis is thus to assess the variability and the degree of imperfection in human coders abilities to classify instances along the dimensions of a categorical coding scheme design to measure populism in textual data.

To do so, I fit Bayesian annotation models to the human codings Hua, Abou-Chadi, and Barberá (2018) have collect on the crowd-sourcing platform CrowdFlower. Specifically, I fit the Beta-Binomial by Annotator (BBA) model proposed by Carpenter (2008) that allows to estimate the positive class’s prevalence, items’ class membership, as well as coders’ individual specificities and sensitivities from the codings data. Assessing coders’ abilities in this ‘crowd’ of untrained human coders allows to determine a lower benchmark on the aggregate measurement quality their judgments yield. The specific case I select for analysis thus constitutes a hard test for my argument: If model-based estimates of human coders’ abilities indicate close-to-perfect coder abilities, we would be less concerned that aggreement-in-error causes errenous classifications, and generally confident in the ability of humans to perform this particular content-analytical task. Likewise, if the model-based aggregation of noisy labels into instance-level measruements does not yield substantively different classifications that do non-parametric aggregation methods, the added value of obtaining estimates of coder abilities seems unjustified.

The remainder of this reasearch note is structured as follows: First, I introduce the measurements instrument used by Hua, Abou-Chadi, and Barberá (2018) to obtain their crowd codings and describe the data. I then introduce the BBA model, discuss its notation, and show results from a simulation study that demonstrate that its implementation in JAGS (???) allows to recover simulated parameter values.

Data and empirical strategy

Crowd-sourced codings of populism in textual data

Hua et al. have recruited crowd workers on the platform CrowdFlower to code social media posts created by a selected number of accounts of Wesrtern European parties and their leaders according to the following coding scheme:

  1. Filter questions:
    1. This post has no text or its content is impossible to understand.
    2. I understand the message of this social media post.
  2. Anti-elitism: Does this tweet/post criticize or mention in a negative way the elites?
  3. People-centrism: Does this tweet/post mention in a positive way or even praise the people (citizens of the country, the working class, the native …) or the nation?
  4. Exclusionism: Does this tweet/post criticize minorities or specific groups of people (muslims, jews, LGBT people, poor people …)?

Coders were asked to answer Yes or No to all questions; all judgments obtained are thus binary. If a coder answered question 1.1 affirmatively for a post, she was asked to skip the given post and proceed with judging the next one. If a coder answered question 1.2 affirmatively for a post, she was asked to proceed with answering questions 2–4 and then proceed with judging the next post.

Of particular interst are the judgments Hua et al. collected in an effort to assertain the reliability of their crowd-source measurement instrument. Interested in computing inter-coder agreement metrics, Hua et al. crowd-sourced judgments from 1 different coders for a set of 1 different social media posts (items). As the following table shows, each item was coded between 1 and four times.

No. Judgments No. Coders \(N\)
1 1 507
2 2 89
3 3 902
4 4 2

Keeping all judgments that passed the first two filter questions,1 I was able to retain a total 1489 items from the original validation data.

The Bayesian Beta-Binomial by Annotator model

I assume that anti-elitism, people-centrism, and exclusionism—the three dimensions of Hua et al.’s measurement instrument—are latent binary features of political posts, and hence crowd coders act as human content analysts whose judgments I want to aggregate at the item-level to estimate whether a given item belongs to either of these three categories.

For each dimension, the setup can be described as a four-tuple \(\langle\mathcal{I}, \mathcal{J}, \mathcal{K}, \mathcal{Y}\rangle\), where

  • \(\mathcal{I}\) is the set of \(i \in 1,\ \ldots ,\ n\) distributed for crowd coding,
  • \(\mathcal{J}\) is the set of \(j \in 1,\ \ldots ,\ m\),
  • \(\mathcal{K}\) is the set of \(k \in \{0, 1\}\) defined by the categorical coding scheme used during crowd-coding, and
  • \(\mathcal{Y}\) is the set of (or codings) \(y_{i,j} \in \{0, 1\}\) recorded for item \(i\) by coder \(j\).

In total, we have \(d = n \times m\) judgments. In the validation data, it is provided that \(\mathcal{Y}\) contains at least one judgment per item, that is, \(|\mathcal{Y}_i| \geq 1\ \forall\ i \in \mathcal{I}\), and that \(|\mathcal{Y}_i| \geq 2\ \forall\ i \in \mathcal{I}' \subset \mathcal{I}\). Moreover, judgments for each item in \(\mathcal{I}'\) are generated by different coders (i.e., we have repeated annotation at the statement level, but not at the judgment-item level).

Importantly, while coders’ judgments of items are observed, items’ true classes \(c_i \in \mathcal{K}\) are unknown a priori for all \(i = 1,\ \ldots,\ n\). In this setup, a classification of items into classes obtained from a set of judgments \(\rho(\mathcal{Y}) \Rightarrow \mathcal{C}\) is called a ground truth labeling. I obtain ground-truth class estimates by fitting the following model to the judgment data:

\[ \begin{align*} c_i &\sim\ \mbox{Bernoulli}(\pi)\\ \theta_{0j} &\sim\ \mbox{Beta}(\alpha_0 , \beta_0)\\ \theta_{1j} &\sim\ \mbox{Beta}(\alpha_1 , \beta_1)\\ y_{ij} &\sim\ \mbox{Bernoulli}(c_i\theta_{1j} + (1 - c_i)(1 - \theta_{0j}))\\ {}&{}\\ \pi &\sim\ \mbox{Beta}(1,1)\\ \alpha_0/(\alpha_0 + \beta_0) &\sim\ \mbox{Beta}(1,1)\\ \alpha_0+\beta_0 &\sim\ \mbox{Pareto}(1.5)\\ \alpha_1/(\alpha_1 + \beta_1) &\sim\ \mbox{Beta}(1,1)\\ \alpha_1+\beta_1 &\sim\ \mbox{Pareto}(1.5) \end{align*} \] where

  • \(c_i\) is the ‘true’ (unobserved) class of statement \(i\),
  • \(\pi\) is the ‘true’ prevalence of the positive class,
  • \(\theta_{0,j}\) is coder \(j\)’s specificity (true-negative rate), and
  • \(\theta_{1,j}\) is her sensitivity (true-positive rate).

Carpenter (2008) refers to this model as the Beta-Binomial by Annotator (BBA) model. This name is due to its property that, given a conjugate beta prior, the posterior densities of items’ class membership follow a beta-binomial distribution. In Carpenter’s original formulation, all priors are choosen to be uninformative, as we often have no domain-specific prior knowledge about items classes, coders’ abilities and positive class prevalence. Note that the distribution of coder specificities is parametrezized in terms of the mean and scale of the Beta-distribution. Specifically, I choose a uniform prior for the mean \(\alpha/(\alpha+\beta)\) and a uniform prior for the inverse square scale \(1/(\alpha+\beta)^2\). As Carpenter explicates, the prior on the means is conveniently expressed using a Beta distribution with \(a = b = 1\), whereas the flat prior on the inverse square scale is expressed as a Pareto prior with \(\alpha = 1.5, c=1\).

Let’s have a look at how the mean and scale parameters are distributed

These priors can also be inspected with regard to what priors they induce on \(\alpha\) and \(\beta\), respectively. To do so, we sample 1000 pairs from these distributions and compute \(\alpha^{sim}\) and \(\beta^{sim}\) values from each pair.

Finally, I use the sampled \(\alpha^{sim}\) and \(\beta^{sim}\) values to obtain draws from a correspondingly parameterized Beta distributions. This yields the following empirical distribution:

Essentially, the prior is pushing coders’ ability parameters to the extremes.

Simulated data

In order to demonstrate that the model can recover true parameter values, I simulated 1000 judgments with the following parameters: \(n = 1000\), \(m = 20\), \(\pi = 0.2\), \(\alpha_0 = 40\), \(\beta_0 = 8\), \(\alpha_1 = 20\), \(\beta_1 = 8\). In addition, to reflect the incomplete panel design of the validation data (i.e., not all items are judged by all coders), I have set a missingness rate of 0.5. Thus, for each item, I simualte judgments by ten randomly selected coders.

Annotators’ specificity and sensitivity parameters \(\theta_{j0}, \theta_{j1}\) are drawn from \(f_{\text{Beta}}(\alpha_0, \beta_0)\) with \((\alpha_0, \beta_0) = (40, 8)\) and \(f_{\text{Beta}}(\alpha_1, \beta_1)\) with \((\alpha_1, \beta_1) = (20, 8)\), respectively. The probability density functions (PDFs) of the \(\theta_{j\cdot}\) parameters look as follows:

Descriptives

Taking 20 independent draws from each distribution, we obtain coders’ ability parameters. As the following plot illustrates, coders ‘true’ abilities nicely scatter so that we have different coder types.2

Given the simulated codings, we then first want to inspect what coder-specfic true-positive and true-negative rates we observe in the simulated data.

We can see that the simulated ‘observed’ coder-specific specificities and sensitivities fall in the ranges [0.735, 0.924] and [0.49, 0.867], respectively, and come close to the ‘true’ parameter values.

Inspecting the BBA model fit

I have implemented the BBA model in JAGS,3 and obtained three MCMC chains with 500 burn-in and 1000 iterations each.4

First, we want to see whether chains mix and the estimates converged. Due to the abundance of parameters in BBA model, I use the deviance information criterion (DIC) to assess mixture and convergence.

Morever, the plotting the shrinkage factor against iterations indicates that the model converged already after only a few iterations.5 We alos find only very limited autocorrelation in the first and second iteration, so there is apparently no need for thinning.6

Plotting posterior estimates of \(\pi\), we see that they nicely mix and converge as well:

Inspecting mixture and convergense for all posterior estimates of coder abilities is not feasible, since we have 20 posterior density distributions for each \(\theta_0, \theta_1\). Instead we directly plot marginal posterior densities by coder and parameter: By and large, posterior ability estimates come close to the true parameter values. However, we can see for specificities that we overestimate (underestimate) ability for compartively low (high) specificity values. In fact, as the next plot illustrates, estimates and true specificity values are negatively correlated, whereas for sensitivity parameters we obtain estimates that are weakly positively correlated with true values.

This result is stunning, since I basically chose the same simulation parameter values as Carpenter (2008 Subsection 3.2), but cannot reproduce his Figure 9 (he demonstrates strong positive correlation between simulated and posterior mean values).

As a consequence of the models limited ability to recover true parameter values, its accuracy in classifying items is 0.682, and thus quite limited, too.

A look at the estimates of hyperameter values \(\alpha_0, \beta_0\) and \(\alpha_1, \beta_1\), respectively, suggestest that this has not necessarily too do with the models general failure to recover simulated parameter values. Indeed, this plot is very similar to Carpenter’s Figure 11.

Analysis

Estimation

Before turning to the analysis of coder abilities in people-centrism, anti-elitism, and exclusionism classification, respectively, I fit the models and report convergence metrics.

People-centrism

We begin with the first dimension of the measurment instrument: people-centrism. In this context, the positive class unites posts that feature people-centrist statements. From studies in other domains (news articles, speeches), we expect the prevalence to not exceed 40%. With regard to coders abilities, we expect most coders to be non-adversarial (i.e., their judgments are not negatively correlated with item classes), as crowd workers were allowed to participate only if they successfuly compelted eight out of ten initial gold screening tasks. As these beliefs are however not supported by domain-specific data, I decided to go with uninformative priors.

I obtain MCMC estimates using JAGS with three chains, 1000 burn-in iterations, and 40K iterations with thinning parameter set to 20. These choices are based on inspecting convergence and autocorrelation in initial models with fewer iterations and less (or no) thinning.

Inspecting the DIC confirms that the model converged and chains are well-mixed. Using these fitting parameters yields a shrinkage factor that is sharply declining within the first post-burn-in iterations and then is very close to 1. Moreover, with the thinning parameter set to 20, autocorrelation in posterior estimates is neglible.7

Anti-elitism

Next, I fit a BBA model to the crowd-sourced validation data for anti-elitism classification. I again use uninformative priors. Specifically, I obtain MCMC estimates using JAGS with three chains, 5K burn-in iterations, and 15K iterations with thinning parameter set to 15. These choices are based on inspecting convergence and autocorrelation in initial models with fewer iterations and less (or no) thinning. Judging by the DIC, all chains mix nicely and converege quickly. And with the thinning parameter set to 15, we effectly reduce autocorrelation to tolerable levels.8

Exclusionism

Lastely, I obtain posterior estimates of a BBA model fitted to crowd coder judgments of post regarding their class memberhsip in the exclusionism category. I obtained three chains, 10K burn-in iterations, and 100K iterations with thinning parameter set to 50. Again, these choices are based on inspecting convergence and autocorrelation in initial models with fewer iterations and less (or no) thinning.

Judging by the DIC, there is some drift in DIC values that only levels off after the first 50K iterations, when chains start to mix nicely. Hence, the shrinkage factor approaches one only after some ten thousand iterations. What is more, with the thinning parameter set to 50 we still have substantial autocorrelation. The estimates obtained by fitting the BBA model to the exclusionism judgments are thus to be taken with a grain of salt.9

Evaluation

Coder abilities

As I have motivated in the introduction, the primary goal of this research note is coder abilities. Without known instances true classes and no prior knowledge about coders abilities, the Bayesian BBA model offers a perfect tool to scrutinize this question, as it allows to estimate the sensitivities and specificities of coders who have contributed their judgments of items’ class membership.

People-centrism

With regard to coders’ classification of non-people-centrist item, the picture is relatively homogenous: Coders are generally found to be highly specific, that is, they are found to perform well in correctly classifying non-people-centrist items.10 The mass of most posterior densities of \(\theta_{0\cdot}\) parameters is in the range \([.75,1)\). With regard to coders’ true-positive detection abilities, there are some outliers with sensitivities in the range \([.4, .6]\) (e.g., coders 7-9, 17, 38, and 39) and even substantial posterior densitiy mass below .5 (specifically coder 32). Hence, the distribution of posterior means is more dispersed in case of sensitivities than specificities:

The validation data thus gives reason to believe that the sampled coders are somewhat heterogenous in terms of their classification abilities, at least with regard to sensitivities. This is conclusion is supported when looking at the posterior distributions of the hyperparameters on sensitivity and specificity distributions:

Only \(\beta_0\), the second shape parameter of the specificity hyperdistribution, can be estimated with comparatively high precision. Take for instance the shape parameters of coders’ sensitivities, \(\alpha_1, \beta_1\). 80% of their values lie in the range \(\alpha_1 \in\) [1.493, 12.435] and \(\beta_1 \in\) [0.678, 6.827]. Due to the flexibility of the Beta-distribution into which these hyperparameters feed, as the next figure illustrates, we get differently shaped posterior densities depending on the selected quantile values:

We can thus conclude that the mass of crwod coders were very specific and overwhelmingly non-adversarial but performed partially only medicore in classifying true-positive items.

Anti-elitism

Inspecting posterior estimates of coders’ sensitivity and specificity parameters, the picture is similar to that in case of people-centrism classification: Coders are generally highly specific, yet the sampled coders are more heterogenous with regard to their abilities to correctly classify positive items, as is supported by the following figure:

Having specified uninformative priors, the validation data thus gives reason to believe that the coder population is somewhat heterogenous in terms of true-positive classification abilities, but more often than not non-adversarial and better than chance. Again, this is confirmed when looking at the distributions of hyperparameters of sensitivity and specificity distributions:

Compared to hyperparameter estimates in the case of people-centrism classification, densities are less dispersed with the minor exception of \(\alpha_0\) Take for instance the shape parameters of coders’ specificities, \(\alpha_1, \beta_1\). 80% of their values lie in the range \(\alpha_1 \in\) [1.133, 3.81] and \(\beta_1 \in\) [0.394, 1.413]. We can again inspect the shape of posterior hyperparameter distributions at selected quantile values.

With above-median hyperparameter values, however, posterior ability distributions have their vast shares of mass on non-adversarial values (i.e., > .5), and again, we have reason to believe that coders are both highly specific as well as, though somewhat less so, sensitive.

Exclusionism

Finally, we are interested in the distribution of coder abilities in exclusionism classification.

Inspecting posterior estimates of coders’ sensitivity and specificity parameters, we get a relatively clear-cut and familiar picture. Posterior estimates of coders’ sensitivities are virutally all non-adversarial, and the mass of posterioir densities lies in regions that indicate better-than-chance classification abilities.

Posterior estimates of coders’ specificities are suspicious, however. Indeed, all coders’ posterior specificity densities are concentrated heavily on values close to 1 (i.e., perfect true-negative detection abilities).11 Plotting the distribution of coders mean posterior ability parameter estimates underlines this result:

Given our doubt in the quality of posterior estimates due to the slow convergence and high autocorrelation in chains, I refrain from interpreting these results as indication of coders’ high (mediocre) true-negative (true-positive) classification abilities.

Agreement between model- and majority-voting based class assginments

While we have reason to doubt the quality of posterior estimates of the BBA model fitted to crowd-sourced exclusionism classification, we can interprete the posterior estimates of coders’ abilities obtained for their classification of posts into people-centrist and anti-elitist instances to indicate that they performed overwhelmingly well in detecting true-negative instances, less well, though, in detecting true-positive instances. Given this evidence of positivity bias in coders’ classification abilities (they tend to err when classifying items as positive, and are often right when classifying items as negative), I expect that aggregating judgments by determining the majority-winner label induces a positive bias at the aggregate level, since majority voting does not account for coders’ fallibility.

(Dis)Agreement of model-based posterior classification and majority voting for people-centrism classification.
Agree Posterior Classification \(n_i\) \(N\) Proportion
no 0 1 33 0.033
no 0 2 2 0.002
no 0 3 5 0.005
no 1 3 1 0.001
yes 0 1 446 0.446
yes 0 2 94 0.094
yes 0 3 788 0.788
yes 0 4 2 0.002
yes 1 1 34 0.034
yes 1 2 8 0.008
yes 1 3 76 0.076

Indeed there are in total only 41 out of 1489 items (i.e., 4.1%) for which model-based and majority-voting classifications disagree. The vast share of this disagreement results from items that are classified as featuring people-centrism with majortiy voting but not when using BBA model-based aggregation (40). Hence, there is evidence of positive bias in majority-voting based classifications. Importantly, this disagreement occurs most often where only one coder judged an item. As a consequence of these differences, the empirical prevalence (not to confuse with \(\pi\)) differs somewhat between classification methods: 0.158 in case of majority voting vs. 0.119 in model-based classification.

Though the the same line of reasoning could be applied, we find no evidence of positive bias in majority-winner classifications of items into the anti-elitism class:

(Dis)Agreement of model-based posterior classification and majority voting for anit-elite classification.
Agree Posterior Classification \(n_i\) \(N\) Proportion
no 0 3 5 0.005
no 1 2 1 0.001
no 1 3 24 0.024
yes 0 1 380 0.380
yes 0 2 89 0.089
yes 0 3 612 0.612
yes 1 1 133 0.133
yes 1 2 14 0.014
yes 1 3 229 0.229
yes 1 4 2 0.002

Indeed there are in total only 30 out of 1489 items for which model-based and majority-voting classifications disagree.

Conclusion

The goal of this research note was (i) to estimate the degree of imperfection in human coders’ abilities to classify political texts into people-centrist, anti-elitist, and exclusionist instances, respectively, and (ii) to assess whether classifications of items obtaine by aggregating the judgment provided by imperfect (i.e., noisy) coders using majority voting differ substantially from those obtained by fitting a Bayesian annotation model that accounts for human coding error to the same data. The analysis presented in this research note has yielded mix results. Posterior estimates of coders’ true-negative classification abilities where generally found to be close to perfect (though the estimates obtained for exclusionism classification are less trustworthy due to ppor convergence). Estimates of true-positive classification abilities, in turn, indicate that coders performed more poorly in detecting truely people-centrist, anti-elite, and exclusionary texts.

By and large, these coder-level imperfects have little impact on item-level classifications, however: For people-centrism classification, there are in total only 41 out of 1489 items for which model-based and majority-voting classifications disagree (for which model-based classifications are overwhelmingly negative); and for anti-elitism, there is only disagreement in 30 classifications (for which model-based classifications are overwhelmingly positive).

One may thus conclude that the added value of fitting BBA models or another type of annotation model to obtain posterior classifications of political texts in the case of populism measurement is limited. This, conclusion is supported by the fact that the model fitted to coders’ judgments for exclusionism had difficulties to yield high-quality posterior estimates.

The last word in this debate is not spoken, however, as the number of judgments per item in the validation data ranged between only one and four. As Benoit et al. (2016) demonstrate in another political science application, aggregating higher numbers of crowd-sourced judgments at the item level tends to yields better posterior classification quality—an expectation that is supported by both statistical reasoning and evidence from simulation studies (Snow et al. 2008; Hsueh, Melville, and Sindhwani 2009; Guan et al. 2017).

References

Aslanidis, Paris. 2018. “Populism as a Collective Action Master Frame for Transnational Mobilization.” Sociological Forum 33 (2): 443–64. https://doi.org/10.1111/socf.12424.

Benoit, Kenneth, Drew Conway, Benjamin E. Lauderdale, Michael Laver, and Slava Mikhaylov. 2016. “Crowd-Sourced Text Analysis: Reproducible and Agile Production of Political Data.” American Political Science Review 110 (02): 278–95. https://doi.org/10.1017/S0003055416000058.

Bergman, Matthew E. 2018. “Quantitative Measures of Populism: A Survey.” SSRN Scholarly Paper ID 3175536. Rochester, NY: Social Science Research Network. https://papers.ssrn.com/abstract=3175536.

Carpenter, Bob. 2008. “Multilevel Bayesian Models of Categorical Data Annotation.” Unpublished manuscript. unpublished manuscript. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.174.1374&rep=rep1&type=pdf.

Dai, Yaoyao. 2019. “Measuring Populism in Context: A Supervised Approach with Word Embedding Models.”

Dawid, Alexander Philip, and Allan M Skene. 1979. “Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm.” Applied Statistics 28 (1): 20–28. https://doi.org/10.2307/2346806.

Ernst, Nicole, Sven Engesser, Florin Büchel, Sina Blassnig, and Frank Esser. 2017. “Extreme Parties and Populism: An Analysis of Facebook and Twitter Across Six Countries.” Information, Communication & Society 20 (9): 1347–64. https://doi.org/10.1080/1369118X.2017.1329333.

Guan, Melody Y, Varun Gulshan, Andrew M Dai, and Geoffrey E Hinton. 2017. “Who Said What: Modeling Individual Labelers Improves Classification.” arXiv Preprint arXiv:1703.08774.

Hawkins, Kirk A. 2009. “Is Chávez Populist?: Measuring Populist Discourse in Comparative Perspective.” Comparative Political Studies 42 (8): 1040–67. https://doi.org/10.1177/0010414009331721.

Hawkins, Kirk A., Ryan E. Carlin, Levente Littvay, and Cristóbal Rovira Kaltwasser. 2018. The Ideational Approach to Populism: Concept, Theory, and Analysis. Routledge.

Hawkins, Kirk A, and Bruno Castanho Silva. 2018. “Textual Analysis.” In The Ideational Approach to Populism: Concept, Theory, and Analysis, edited by Kirk A. Hawkins, Ryan E. Carlin, Levente Littvay, and Cristóbal Rovira Kaltwasser. Routledge.

Hsueh, Pei-Yun, Prem Melville, and Vikas Sindhwani. 2009. “Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria.” In Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing, 27–35. Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1564131.1564137.

Hua, Whitney, Tarik Abou-Chadi, and Pablo Barberá. 2018. “Networked Populism: Characterizing the Public Rhetoric of Populist Parties in Europe.” In. Paper prepared for the 2018 MPSA Conference.

Jagers, Jan, and Stefaan Walgrave. 2007. “Populism as Political Communication Style: An Empirical Study of Political Parties’ Discourse in Belgium.” European Journal of Political Research 46 (3): 319–45. https://doi.org/10.1111/j.1475-6765.2006.00690.x.

Krippendorff, Klaus. 2004. Content Analysis: An Introduction to Its Methodology. 2nd ed. Thousand Oaks, Calif: Sage.

March, Luke. 2018. “Textual Analysis.” In The Ideational Approach to Populism: Concept, Theory, and Analysis, edited by Kirk A. Hawkins, Ryan E. Carlin, Levente Littvay, and Cristóbal Rovira Kaltwasser. Routledge.

Oliver, J Eric, and Wendy M Rahn. 2016. “Rise of the Trumpenvolk: Populism in the 2016 Election.” The ANNALS of the American Academy of Political and Social Science 667 (1): 189–206.

Passonneau, Rebecca J, and Bob Carpenter. 2014. “The Benefits of a Model of Annotation.” Transactions of the Association for Computational Linguistics 2: 311–26.

Pauwels, Teun. 2011. “Measuring Populism: A Quantitative Text Analysis of Party Literature in Belgium.” Journal of Elections, Public Opinion & Parties 21 (1): 97–119. https://doi.org/10.1080/17457289.2011.539483.

———. 2017. “Measuring Populism: A Review of Current Approaches.” In Political Populism: A Handbook, edited by Reinhard C. Heinisch, Christina Holtz-Bacha, and Oscar Mazzoleni, 123–36. Nomos.

Polk, Jonathan, Jan Rovny, Ryan Bakker, Erica Edwards, Liesbet Hooghe, Seth Jolly, Jelle Koedam, et al. 2017. “Explaining the Salience of Anti-Elitism and Reducing Political Corruption for Political Parties in Europe with the 2014 Chapel Hill Expert Survey Data.” Research & Politics 4 (1): 1–9. https://doi.org/10.1177/2053168016686915.

Rooduijn, Matthijs, Sarah L de Lange, and Wouter van der Brug. 2014. “A Populist Zeitgeist? Programmatic Contagion by Populist Parties in Western Europe.” Party Politics 20 (4): 563–75. https://doi.org/10.1177/1354068811436065.

Rooduijn, Matthijs, and Teun Pauwels. 2011. “Measuring Populism: Comparing Two Methods of Content Analysis.” West European Politics 34 (6): 1272–83. https://doi.org/10.1080/01402382.2011.616665.

Snow, Rion, Brendan O’Connor, Daniel Jurafsky, and Andrew Y Ng. 2008. “Cheap and Fast—but Is It Good?: Evaluating Non-Expert Annotations for Natural Language Tasks.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 254–63. Association for Computational Linguistics.


  1. These were all items for which the meta variable fitler had the value ‘ok’.

  2. That is, individual coders combine different capabilities ranging from low to high specificity as well as from low to high sensitivity, respectively.

  3. The complete JAGS code for the model reads as follows:

    model{
    for (i in 1:N){
        c[i] ~ dbern(pi)
    }
    
    for (j in 1:M) {
        theta0[j]~ dbeta(alpha0, beta0);
        theta1[j]~ dbeta(alpha1, beta1);
    }
    
    for (j in 1:J) {
        y[j,3] ~ dbern(c[y[j,1]]*theta1[y[j,2]]+(1-c[y[j,1]])*(1-theta0[y[j,2]]))
    }
    
    pi ~ dbeta(1,1);
    
    mean0 ~ dbeta(1,1);
    scale0 ~ dpar(1.5,1);
    alpha0 <- mean0*scale0;
    beta0 <- scale0-alpha0;
    
    mean1 ~ dbeta(1,1);
    scale1 ~ dpar(1.5,1);
    alpha1 <- mean1*scale1;
    beta1 <- scale1-alpha1;
    }
  4. Updating and fitting the model took 1.381 minutes on my MacBook Pro with a 3.5 GHz Intel Core i7 Processor using one core.

  5. Gelman plot of BBA model fitted to simulated codings data:

  6. Autocorrelation in subsequent estimates for all three chains of BBA model fitted to simulated codings data:

  7. Gelman and autocorrelation plots for BBA model fitted to crowd-sourced people-centrism classifications in Hua et al.’s validation data (3 MCMC chains with 1000 burn-in iterations, and 40K iterations with thinning parameter set to 20):

  8. Posterior estimates obtained from fitting BBA model to crowdsourced anti-elitism data:

  9. Posterior estimates obtained from fitting BBA model to crowd-sourced exclusionism classification data:

  10. Note that I hace post-hoc transformed the parameter assignment obtained by fitting the BBA model, because the mean posterior estimate of \(\pi\), the prevalence of people-centrism in social media posts, was unreasonably high, indicating that all three chains obtained the revers parameter assignment. Convergence on the reverse assignment is a phenomenon that results from the non-identifiability of the BBA model (Carpenter 2008, 7). Post-hoc transformation obtains the correct parameter assignment: \(\mathcal{P} = \left(\{c_i\}_{i\in 1, \ldots, n}, \pi, \{\theta_{j0}\}_{j\in\,1, \ldots, m}, \{\theta_{j1}\}_{j\in\,1, \ldots, m}, \alpha_0, \beta_0, \alpha_1, \beta_1 \right)\): \(\mathcal{P}' = \left( \{1-c_i\}_{i\in 1, \ldots, n}, 1-\pi, \{1-\theta_{j1}\}_{j\in\,1, \ldots, m}, \{1-\theta_{j0}\}_{j\in\,1, \ldots, m} \beta_1, \alpha_1, \beta_0, \alpha_0 \right)\) In comparison, the reversed assignment \(\mathcal{P}'\), obtains \(c_i' = 1- c_i\), reflects prevalence estimates around 0.5, and swaps and reflect sensitivity and specicity parameters around 0.5 (Carpenter 2008, 7f.).

  11. While this may be in parts due to the very low prevalence of exclusionism instances (the prevalence is 0.097) that implies that coders will almost always be correct if they classify an item as negative, this may also be due to the poor convergence of the model (as discussed above).